Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST

Spatial transcriptomics technologies generate gene expression profiles with spatial context, requiring spatially informed analysis tools for three key tasks, spatial clustering, multisample integration, and cell-type deconvolution. We present GraphST, a graph self-supervised contrastive learning method that fully exploits spatial transcriptomics data to outperform existing methods. It combines graph neural networks with self-supervised contrastive learning to learn informative and discriminative spot representations by minimizing the embedding distance between spatially adjacent spots and vice versa. We demonstrated GraphST on multiple tissue types and technology platforms. GraphST achieved 10% higher clustering accuracy and better delineated fine-grained tissue structures in brain and embryo tissues. GraphST is also the only method that can jointly analyze multiple tissue slices in vertical or horizontal integration while correcting batch effects. Lastly, GraphST demonstrated superior cell-type deconvolution to capture spatial niches like lymph node germinal centers and exhausted tumor infiltrating T cells in breast tumor tissue.

Overall, the topic is important and timely. While this method has strong potential in application, the manuscript lacks clarity on the details of the method, and it is unclear what unique advantages the method has compared with other existing methods in terms of its performance. Here are several specific major points in addition to multiple minor point for improving the work. Below are detailed comments: Major concerns: 1. Lines 123-124, "To our knowledge, few existing methods use self-supervised contrastive learning on spatial transcriptomics". While the work may be new in using self-supervised contrastive learning on spatial transcriptomics, there are some papers that use such kind of methods to address similar questions. For example, 1) Ren, H., Walker, B.L., Cang, Z. et al. Identifying multicellular spatiotemporal organization of cells with SpaceFlow. Nat Commun 13, 4076 (2022). https://doi.org/10.1038/s41467-022-31739w 2) conST: an interpretable multi-modal contrastive learning framework for spatial transcriptomics.Yongshuo Zong, Tingyang Yu, Xuesong Wang, Yixuan Wang, Zhihang Hu, Yu Li bioRxiv 2022.01.14.476408; doi: https://doi.org/10.1101/2022.01.14.476408. What is the novelty of the proposed method compared to those two methods? 2. I tried to test and reproduce results shown in this manuscript using the provided links in the manuscript (https://deepst-tutorials.readthedocs.io/), and observed the following problem: a. DeepST/utils.py:121, in refine_label(adata, radius, key) 120 for j in range(1, n_neigh+1): --> 121 neigh_type.append(old_type[index[j]]) IndexError: index 3602 is out of bounds for axis 0 with size 3583 Basically, in the clustering step (In refine_label function), the shape of old_type and index are inconsistent, which might be caused by the inconsistent size between the dimensions from adata and distance matrix. As a result, I couldn't reproduce the results shown in the paper. See the attached ` DeepST_test.ipynb` for details. 3. Regarding to the method, especially the contrastive learning component, both the formula and the ideas are very similar to Deep Graph Infomax (DGI) (Veličković et al. 2018). What are the novel elements and major differences between current method and DGI? This needs to be addressed. 4. What are the meaning and motivation of formula (5), line 734? It's important to show the performance difference with and without adding this term by experiment, because DGI only contains (4) instead of (5). Does this term actually improve the performance? 5. The reason of using reconstructed expression data to cluster instead of using the latent embedding need to be justified. Moreover, why choosing mclust over graph-based methods such as Leiden, Louvain? It's important to justify such choices in terms of data analysis and results. 6. In line 684, do the authors augment data through creating corrupted graph by randomly adding or dropping edges? What is the effect of such procedure on the overall performance of the method. 7. Regarding the data integration performance shown in Figure 4, why did the authors not compare many other methods designed for nonspatial scRNA-seq data, such as scVI (Lopez et al. 2018) and Harmony (Korsunsky et  Minor points 9. To better support the manual annotation result in Fig 6A, the spatial expression distribution of several marker genes for each panels in Fig 6A need to be added. 10. In line 836, the author mentioned the first loss term indicates contrastive loss, why is there only one instead of two terms? What is the meaning of the first term? 11. In line 712, which norm is used? L1 or L2 or others? 12. In Fig 3C, the titles of panels Mesenchyme and Dermomyotome seem misplaced. 13. All color bars need to be explained for their meanings. 14. Many typos and grammar errors in the manuscript, e.g., in line 28, "has" should be "have"; in line 59, "K-means" should be "k-means" 15. Lines 124-126, "Using self-supervised contrastive learning improves performance in learning relevant latent features and has the additional benefit of removing batch effects". This sentence occurs without any supporting evidence. It needs to be fixed. 16. In lines 158-162, the authors introduced "self-reconstruction loss" and "contrastive loss" and their effects. It's important to show what the two losses are in the context of biology. 17. It's unclear how the neighbor graph is constructed. In the caption of Fig. 1 (lines 1034-1035), the authors wrote "…neighbor graph constructed using spot coordinates (x,y) of that fall within a distance threshold". However, in the method section in lines 665-675, the authors wrote "Finally, we select the top k-nearest spots as its neighbors". It's unclear whether the authors used a distance threshold or a threshold for k. 18. Regarding the method (lines 655-659), the descriptions seem to be for the spatial transcriptomics data. However, this is not clear from the description, as two kinds of datasets (spatial transcriptomics data and scRNA-seq data) are mentioned in this paper. 19. In lines 684-688, "…while keeping the original graph structure unchanged": was the corrupted neighbor graph G' the same as the original G? 20. In lines 708-709, "W_d and b_d represent the trainable weight matrix and bias vector, respectively, which are shared by all nodes in the graph". Please justify why W_d and b_d need to be shared by all nodes in the graph. Besides, is this the same case for W_e and b_e? Reviewer #2: Remarks to the Author: In the manuscript "DeepST: A novel graph …." by Long and coworkers the authors develop a new method for better describing spatial transcriptomics data and being able to integrate multiple studies. The method is based on graph neural networks and contrastive learning, which makes it possible to combine scRNA-seq of better resolution and spatial transcriptomics. The method is logically sound and makes a lot of sense. Moreover, authors show that it empirically identifies more relevant clusters and allows data integration for higher power. Although, I am not an expert in spatial transcriptomics these problems seem of great importance and authors spend good effort to show that it works a planned. Having said that my expertise is in neural networks and translational bioinformatics I believe that the paper would be a good contribution to the spatial transcriptomics field. From my side, I have no concerns of the paper and like to see it published.

20
With rapid technological advances in spatial transcriptomics, it is now widely applied towards studying 21 tissue complexity and cell-cell communications. However, the current bottleneck still lies in data 22 analysis. Although multiple methods have been developed for spatial transcriptomics, there is still a 23 great need for developing novel tools that offer greater accuracy, robustness, and generalizability 24 towards a wide range of application on different tissue types and technology platforms. Furthermore, 25 the analysis pipeline for spatial transcriptomic data comprises three key tasks, namely spatial clustering, 26 multi-sample integration, and cell type deconvolution. However, there is no comprehensive tool that can 27 perform all these three tasks. To overcome this limitation, we developed GraphST, the first of its kind 28 that integrates these tasks into a streamlined process. Most importantly, GraphST outperforms existing 29 methods in each task. We achieved this by adopting and tailoring graph self-supervised contrastive 30 learning for spatial transcriptomics analysis.

31
In the spatial clustering task, we achieved higher accuracy and robustness with an average of 10% 32 improvement over the best of existing methods in a variety of datasets. Our GraphST clusters revealed 33 finer tissue structures and niches in complex tissues such as the brain, olfactory bulb, and embryo.

34
Although existing methods conST and SpaceFlow also adopted graph contrastive learning for spatial

43
In the multi-sample integration task, GraphST can better correct batch effects when integrating serial 44 tissue slices than existing methods that have been developed for spatial (e.g., STAGATE) or non-spatial 45 batch integration (e.g., scVI and Harmony). Moreover, for the horizontal integration of mouse anterior 46 and posterior brain slices, GraphST outperformed SpaGCN and STAGATE in that GraphST could 47 assign the common cortical layers that aligned well across the shared edge and also reveal the dorsal 48 and ventral horns of the hippocampus regions.

49
In the final task, GraphST produced more accurate cell type deconvolution with simulation data than 50 existing methods, including cell2location that was recognized as the top performing method in a recent 51 benchmark. Moreover, when applied to 10x Visium acquired human lymph node data, GraphST's with higher spatial coherence. Lastly, application on human breast cancer 10x Visium data revealed

69
What is the novelty of the proposed method compared to those two methods? 70 Response 1.1: Thank you very much for raising this critical issue. Indeed, GraphST bears much 71 similarity to conST and SpaceFlow with the use of graph contrastive learning for spatial clustering.

72
However, there are several major differences between GraphST and the other two methods.

73
First, and most importantly, both conST and SpaceFlow were mainly developed for spatial clustering 74 only. In addition to spatial clustering, GraphST can be also applied to two other important ST data 75 analysis tasks, multi-sample integration and cell type deconvolution of ST. GraphST comprises three 76 modules with different network architectures tailored for each of the three tasks respectively.

77
Secondly, even for the spatial clustering task, there are also major differences when comparing

120
Based your comments, we compared the performance of conST, SpaceFlow, and GraphST in the 121 spatial clustering task with the DLPFC dataset. Figure R2  We ran GraphST and the variant on the 12 DLPFC slices and evaluated their performance using their 137 median ARI scores. Figure R4 shows that GraphST outperforms the variant (median ARI score of 0.51) 138 with a significantly higher median ARI score of 0.60. This ablation study demonstrated that local context 139 does help GraphST perform better than with the global context. We have added these results to

140
Supplementary Figure S14A in the revised manuscript.

141
To demonstrate the advantage of symmetric contrastive loss over single contrastive loss, we conducted 142 another ablation study to compare GraphST with a variant that does not use the contrastive corrupted 143 loss (formula (5) in the manuscript). We tested GraphST and this variant on the 12 DLPFC samples 144 and evaluated their performance with the ARI metric. Figure R5 shows that GraphST achieved much

226
Based on your comments, we conducted an ablation study to validate the effectiveness of symmetric 227 contrastive loss. We compared GraphST with a variant without contrastive corrupted loss on the 12 228 DLPFC samples and used the ARI metric for evaluation. The results in Figure R5 show that GraphST 229 without contrastive corrupted loss achieved a lower median ARI score (0.47) than the original GraphST 230 (0.60), supporting the idea that symmetric contrastive loss helps our model achieve better performance.

231
Furthermore, Figure R7 shows that the model training curve is stabilized with symmetric contrastive 232 loss. These results have been included in Figure S14B in the Supplementary.

254
In response to your comments, we compared the clustering performance of using the latent 255 representation and the reconstructed expression on the 12 DLPFC samples. The results in Figure R8 256 show that GraphST achieved much a higher median ARI score when using the reconstructed 257 expression for clustering than the latent representation, suggesting that the former contains more useful

291
During data augmentation, the distribution of the augmented data should be distinguishable from the 292 original data. Otherwise, it can easily cause the model to overfit. Therefore, we adopted random feature 293 swapping to perturb the original data as much as possible, such that the model can learn more useful 294 information from the spatial data.

300
Response 1.7: Thank you for your insightful comments. Following your comments, we added scVI and

301
Harmony to our tests on the two mouse breast cancer datasets for vertical integration ( Figure R11).

302
Both scVI and Harmony were able to mix the two slices, but some batch differences were still visible 303 post integration ( Figure R11B). In comparison, GraphST evenly mixed the two slices, achieving better 304 batch mixing than scVI and Harmony. We also quantitatively evaluated batch mixing with the iLISI metric 305 where the higher iLISI score, the better the batch mixing. GraphST achieved a much higher iLISI score 306 than Harmony and scVI ( Figure R11C), confirming our visual observations. In the post-integration 307 clustering, Harmony failed to align clusters across the two slices. scVI performed better than Harmony 308 but some clusters were still not accurately mapped, such as clusters 1, 5, and 10. In contrast, GraphST's 309 clusters highly overlapped between the two slices.

310
We also tested scVI and Harmony on one more mouse breast cancer sample ( Figure R11D-F). Batch 311 differences remained visible on the Harmony UMAP plot ( Figure R11E). Comparatively, scVI removed 312 the batch effects much better than Harmony. GraphST performed the best by evenly mixing the two 313 slices. In terms of iLISI, Harmony significantly underperformed GraphST while scVI was comparable to 314 GraphST ( Figure R11F). Most of Harmony's clusters did not match across the two slices. While scVI 315 generated clusters that were more consistent than Harmony, some clusters were fragmented, 316 especially in section 2. GraphST again identified clusters that were spatially coherent and aligned well 317 across the two slices. These results have been added to Figure 4 of the revised manuscript. highlighted with white boxes on the H&E images. In contrast, GraphST was able to reveal these regions.

331
Overall, compared with STAGATE, GraphST performed slightly better in the horizontal integration task.

332
We have added STAGATE's results to Figure 4 of the revised manuscript.

358
(2022), we use an augmentation-free contrastive learning method for the third module. Therefore, there 359 is only one term for contrastive loss. The first term (i.e., contrastive loss) in formula (10)

407
The contrastive loss design is based on the assumption that a spot in the spatial data usually has a cell 408 type label similar to its local context, e.g., one-hop or two-hop neighbours. As discussed above in the 409 contrastive learning part, we define positive/negative pairs by pairing spot embedding ℎ /ℎ from the 410 original/corrupted graph with its local summary vector . The local summary vector represents 411 the local context of a spot and is obtained by a sigmoid of the mean of all its neighbours' embeddings.

412
The main goal of contrastive learning is to make spot embedding ℎ close to its local context from 413 the original graph. Therefore, trained with contrastive loss, spatially adjacent spots will have similar 414 embeddings while non-adjacent spots will have dissimilar embeddings.

415
Based on your comment, we have added more biological contexts when describing self-reconstruction 416 loss and contrastive loss in the revised manuscript.

430
Response 1.18: Thank you for your comments. GraphST was developed for three analysis tasks, 431 spatial clustering, multiple ST data integration, and cell type deconvolution of spatial data. Cell type 432 deconvolution is achieved by projecting single-cell RNA-seq data onto the spatial data. Therefore, in 433 addition to spatial data, single cell RNA-seq data is also used in the third module of our framework.